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Time series community genomics analysis reveals 
rapid shifts in bacterial species, strains, and phage 
during infant gut colonization 
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The gastrointestinal microbiome undergoes shifts in species and strain abundances, yet dynamics involving closely related 
microorganisms remain largely unknown because most methods cannot resolve them. We developed new metagenomic 
methods and utilized them to track species and strain level variations in microbial communities in 11 fecal samples collected 
from a premature infant during the first month of life. Ninety six percent of the sequencing reads were assembled into 
scaffolds of >500 bp in length that could be assigned to organisms at the strain level. Six essentially complete [-99%) and 
two near-complete genomes were assembled for bacteria that comprised as little as 1% of the community, as well as nine 
partial genomes of bacteria representing as little as 0.05%. In addition, three viral genomes were assembled and assigned 
to their hosts. The relative abundance of three Staphylococcus epidermidis strains, as well as three phages that infect them, 
changed dramatically over time. Genes possibly related to these shifts include those for resistance to antibiotics, heavy 
metals, and phage. At the species level, we observed the decline of an early-colonizing Propionibacterium acnes strain similar to 
SK137 and the proliferation of novel Propionibacterium and Peptoniphilus species late in colonization. The Propionibacterium 
species differed in their ability to metabolize carbon compounds such as inositol and sialic acid, indicating that shifts in 
species composition likely impact the metabolic potential of the community. These results highlight the benefit of 
reconstructing complete genomes from metagenomic data and demonstrate methods for achieving this goal. 

[Supplemental material is available for this article.] 



The gut microbiome plays a critical role in determining human 
health. Gut microbes influence immune system development and 
function (Maslowski et al. 2009; Lathrop et al. 2011), process nu- 
trients, and affect energy uptake by the host (Hooper and Gordon 
2001; Ley et al. 2005). Inflammatory bowel disease (Xavier and 
Podolsky 2007) and necrotizing enterocolitis (Fell 2005) have been 
linked to abnormalities in the composition of the gut microbiome 
and inappropriate activation of the immune system responses di- 
rected against the gut microbiome. The vast majority of the human 
microbiota reside in the gastrointestinal (GI) tract (Savage 1977). 
The genomes of these microorganisms encode more than 3 million 
unique genes, more than two orders of magnitude larger than the 
number of genes in the human genome (Qin et al. 2010). More 
than 1000 species have been detected in the gut, with represen- 
tatives from at least nine bacterial phyla and about 150 dominant 
species in the gut of each individual (Eckburg et al. 2005; Ley et al. 
2006; Qin et al. 2010). 

Various factors may cause rapid changes within the gut mi- 
crobial community, including the availability of nutrients, drug 
consumption, phage blooms, and the presence of opportunistic 
pathogens (Reyes et al. 2010; Dethlefsen and Relman 201 1; Fukuda 
et al. 2011; Gophna 2011). These changes may be particularly 
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significant during the initial microbial colonization of the infant 
gut when the community is usually comparatively simple, some- 
times consisting of only a few dominant species (Koenig et al. 
2011; Morowitz et al. 2011). It has been previously suggested that 
biodiversity, which is particularly manifest at the species and strain 
level in the gut microbiome, can maintain or enhance ecosystem 
functioning in the face of environmental change (Backhed et al. 
2005). Several studies published in recent years have demon- 
strated the significant functional differences associated with 
strain variation in medically relevant commensals. For example, 
only certain strains of Bifidobacterium longum were shown to 
provide protection against pathogens such as Escherichia coli 
due to the presence of genes encoding ATP-binding-cassette-type 
carbohydrate transporters (Fukuda et al. 2011). Virulence of 
Staphylococcus epidermidis varies among different strains (Gill 
et al. 2005) as does the ability to inhibit Staphylococcus aureus 
biofilm formation (Iwase et al. 2010). 

As the significant role of species and strain level variation is 
increasingly recognized, it is clear that a better understanding 
of these variations in natural environments will enhance our 
understanding of the dynamics of microbial communities. Un- 
fortunately, techniques used to date fall short of providing the 
strain-level resolution required for a comprehensive description 
of microbial communities. 16S rRNA gene surveys are widely used 
for characterizing microbial communities (Palmer et al. 2007; 
Mshvildadze et al. 2010; Jacquot et al. 2011; LaTuga et al. 2011; Mai 
et al. 2011) and may distinguish different species. Sequencing of 
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microbial genomic DNA recovered from microbial community 
samples has the potential to provide information at the strain 
level. However, metagenomic analysis of unassembled or partially 
assembled data cannot provide information about the role of in- 
dividual community members. In contrast, assembled and well- 
curated genomes reconstructed from such samples can provide 
metabolic insight at the species and strain levels. 

In a recent paper (Morowitz et al. 2011), metagenomic data 
were used to provide a detailed description of a microbial com- 
munity during gut colonization. Using 245 Mbp of pyrosequenc- 
ing data, it was possible to reconstruct, de novo, the genomes of 
the dominant Serratia and Citrobacter species and a few plasmids 
and bacteriophages. The study revealed strain-level genotypic 
features that differentiated two Citrobacter strains and that may 
have explained strain abundance fluctuations over time. The study 
was limited in its ability to characterize less abundant organisms, 
yet, these organisms may have major significance for overall com- 
munity function. The Illumina sequencing technology offers new 
opportunities for comprehensive study of microbial communities 
and the forces that govern their development. The orders-of- 
magnitude-greater amount of data from this technology make 
longitudinal time series studies with deep sampling of rare mem- 
bers possible, thus offering opportunities for the in-depth study of 
microbial community dynamics. Subject to sufficient sequencing 
coverage, the availability of accurate methods for the assembly of 
short read sequencing data, as well as methods for binning with 
fine resolution and high sensitivity, it should be possible to re- 
construct complete genomes of all members in the community 
and to identify factors that explain their abundance. 

In this paper, we present a time series study of relatively low 
complexity gut microbiome samples from a newborn premature 
infant. We developed novel approaches for sequence assembly and 
binning, and in so doing, were able to reconstruct complete and 
partial genome sequences of organisms with relative abundance 
as low as 0.05% of the community. This outcome indicates that 
metagenomic data have essentially the same potential to yield 
near-complete genomes as data from cultivated isolates. Our re- 
sults provide insight into the metabolic differences associated with 
species and strain shifts and evidence for phage-based control of 
the strain composition of medically important species. 

Results 

Sample collection 

We studied 11 fecal samples collected on post-natal days 15-24 
from a female infant delivered by Caesarean section at 26 wk of 
gestation. Fecal samples were collected once or twice daily, as 
available, between days 15 and 24. This time frame was selected for 
study because it represents a critical colonization period (and one 
during which abnormalities can lead to neonatal necrotizing 
enterocolitis, although the latter did not develop in this infant) 
(Morowitz et al. 2010). See Methods for further details. 

Microbial genome reconstruction 

Raw data for this study consisted of —260 million 100-bp paired- 
end Illumina HiSeq reads from libraries with insert sizes of 400 
(four samples) and 900 (seven samples) bp from the samples col- 
lected at the 11 time points (—2.4 Gbp per sample) (see details in 
Supplemental Table SI). A novel pipeline was developed for ana- 
lyzing the data in this study (Fig. 6, see below). The pipeline was 



designed to reconstruct high-quality complete genomes for those 
with coverage sufficient to ensure complete sampling and to assign 
all genome fragments longer than 500 bp to organisms accurately. 
A detailed description of the methods is provided in the Methods 
section and supporting online material. 

We developed a new iterative approach in which different 
groups of genomes with similar levels of abundance are assembled 
simultaneously, separately from the rest of the data. This approach 
largely circumvents the complication that single genome assem- 
blers (e.g., Velvet) (Zerbino and Birney 2008) encounter when 
mixtures of genomes with different levels of coverage are assem- 
bled using parameters suitable for just one coverage level. Data 
from all samples were coassembled together. We also developed 
new high-throughput processes for detecting assembly errors, re- 
moving gaps from the assembly, and a program for post-assembly 
manual genome curation that guides manual inspection to im- 
prove assemblies (e.g., where strain variation splits contigs). Typ- 
ically, assembly errors in this study involved inappropriate linkage 
between fragments from the same genome, not chimeras in- 
volving genome segments from multiple organisms. 

A critical component of the community metagenomic anal- 
ysis is the assignment of genome fragments to clusters represent- 
ing operational taxonomic units (OTUs), a process termed bin- 
ning. Accurate binning is required for comparative genome and 
metabolic analysis. Here, we binned the data based on time series 
abundance profiles. Specifically, each scaffold is represented based 
on its relative abundances in the 1 1 samples and clustered with 
other scaffolds with similar abundance profiles. We used the 
Databionics implementation of ESOM (Ultsch and Moerchen 
2005) to cluster all scaffolds longer than 500 bp, using abundance 
profile information (Fig. 1). This resulted in well-defined clusters of 
fragments with the same time series abundance patterns that 
represented either genomes or sets of sequences that belong to one 
of several strains (as in the case of S. epidermidis) (Fig. 1). 

Genome completeness 

The fraction of the genome that was assembled was evaluated for 
all genomes based on the presence of a set of universal single copy 
genes (Raes et al. 2007) and other conserved genes (Supplemental 
Fig. S9), as well as scaffold connectivity and genome size compar- 
ison to closely related, published genomes (refer to the Methods 
section for more details). 

Supplemental Table S6 summarizes genome completeness. 
Complete genomes were reconstructed for four phages. One 
hundred percent connectivity was achieved for the genomes 
of Enterococcus faecalis, a Peptoniphilus species, a Propionibacterium 
species, and S. epidermidis strain 3. The genome of S. epidermidis 
strain 1 had 18 disconnected scaffold ends, all of which could be 
resolved with high confidence. These genomes are considered to 
be 99% complete ("essentially complete"). We refer to genomes 
estimated to be more than —80% complete as "near complete". 
Other genomes are considered to be "partial." 

Species abundance changes and potential significance 

Figure 2 describes time series abundance patterns for all genomes 
detected in our data set. Overall, the community was dominated 
by E. faecalis (phylum Firmicutes, family Enterococcaceae) , a facul- 
tative anaerobic Gram-positive bacterium often detected in feces 
(Morowitz et al. 2011). Other dominant species included skin- 
associated microbes, consistent with previously observed microbial 
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Figure 1. Time series-based ESOM binning of the whole data set (left) and Staphylococcus scaffolds 
(right). (Left) An ESOM in which each point represents a 500- to 6000-bp segment. Data points are 
colored based on their scaffold's best hit from all published genomes in the NCBI nucleotide database 
(white data points have no significant hit). Note that the map is periodic. Clusters are: 1 . Candida 
albicans; 2. Finegoldia magna; 3. Staphylococcus hominis; 4. Staphylococcus lugdunensis; 5. Leuconostoc 
citreum; 6. Peptoniphilus Carrol (novel species); 7. Staphylococcus aureus; 8. Propionibacterium Carrol 
(novel species); 8.1 Propionibacterium acnes; 9. Streptococcus; 1 0. Staphylococcus epidermidis, regions on 
the genome that are common to all strains; 10.1 strain 1 and some low abundance scaffolds that 
probably belong to rare 5. epidermidis strains; 10.3. regions unique to 5. epidermidis strain 3; 10.4. 
regions unique to 5. epidermidis strain 4; 11. Enterococcus faecalis; and 12. Anaerococcus. (Right) An 
ESOM of the Staphylococcus genomes (numbers are the same as in the left panel), their plasmids (ex- 
tension "P") and infecting phage (A: phage 1 3; B: phage 14; and C: phage 46). For 5. epidermidis strain 
1 , dark red represents segments unique to the strain, while light red represents regions common to both 
strain 1 and strain 3 (likewise dark/light green for strain 3). 



community structure in infants delivered via C-section (Dominguez- 
Bello et al. 2010). Four representatives of the Staphylococcus genus 
were identified (phylum Firmicutes, family Staphylococcaceae). 
Staphylococci are Gram-positive commensal bacteria that fre- 
quently colonize the skin and mucous 
membranes, such as the nasal cavities in 
humans and other mammals (Otto 2009). 
S. aureus appeared late, around day 22, and 
was represented by two strains. A low 
abundance phage was also recognized 
that clusters in the time series-based 
ESOM with the S. aureus strains and thus 
may infect them. S. epidermidis was rela- 
tively abundant throughout the time se- 
ries, and strain abundances fluctuated 
significantly. Two other Staphylococcus 
species, S. hominis and S. lugdunensis, were 
present in low abundances. We also iden- 
tified two species of Propionibacterium 
(phylum Actinobacteria, family Propioni- 
bacteriaceae) , a genus whose membership 
includes Gram-positive aerotolerant an- 
aerobes that are commonly found on the 
skin but have also been described in the 
gut (Mackie et al. 1999; Dworkin and 
Falkow 2006). Propionibacterium acnes is 
often associated with sebaceous follicles, 
where secreted lipids are utilized in 
a fermentation-based metabolism (Grice 
and Segre 2011) that produces the pro- 
pionate that gives this genus its name 
(Dworkin and Falkow 2006). Growth 
status (Brzuszkiewicz et al. 2011) as well 
as strain type may influence whether 



P. acnes plays a role in skin homeostasis 
or causes disease, e.g., via immunostim- 
ulation (Nagy et al. 2005). A second 
species of Propionibacterium distinct from 
P. acnes was relatively abundant early 
and late in this colonization series. We 
refer to this species as Propionibacterium 
Carrol (or more simply, Propi. Carrol). In 
this notation, Carrol is a project desig- 
nation, common to all genomes recov- 
ered in this study. A Peptoniphilus species 
Carrol (Pepto. Carrol) appears late in the 
colonization series. Peptinophilus species 
(phylum Firmicutes, family Clostridiales 
family XL Incertae sedis) are also Gram- 
positive bacteria and are typically obli- 
gate anaerobes. Notably, this suggests 
a transition from facultative anaerobes 
toward a community dominated by ob- 
ligate anaerobes. Several other genomes 
were detected at low abundances, in 
particular at the start and the end of 
the period, including Finegoldia magna, 
Anaerococcus (both are Firmicutes, Clos- 
tridiales family XL Incertae sedis), Strepto- 
coccus (Firmicutes, Streptococcaceae) , and 
P. acnes. 

Both Propi. Carrol and Pepto. Carrol 
represent species with no previously published genomes. A 16S 
rRNA-based search for related genomes revealed mostly hits from 
skin environments (Supplemental Tables S3, S4), which raises the 
possibility that both new genomes are from native skin bacteria. 
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Figure 2. Relative abundance in the community of abundant (left) and rare (right) species. Abun- 
dance was computed based on read mapping to unique regions on the assembled genomes. 
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However, the late proliferation of both organisms suggests that 
they are well-adapted to the anaerobic conditions in the gut. 

Community composition was also evaluated using 16S rRNA 
amplicon surveys (Supplemental Fig. S7 and supporting online 
material). These surveys generally reported similar bacterial species 
to those detected in the metagenomic analysis, with overall similar 
abundances. 

Similarity of genomes to published genomes and mobile 
elements 

Essentially-complete and near-complete genomes, except for those 
of Propi. Carrol and Pepto. Carrol, showed a remarkably high sim- 
ilarity to published genomes over most, but not all, of their length 
(Supplemental Table S5; Supplemental Fig. S7). All but these ex- 
ceptions had orthologs for >89% of their genes, on average, in 
every published genome of their species at >96% amino acid (aa) 
level similarity. The similarity of S. epidermidis phage genomes to 
previously reported phage genomes varied greatly. However, sev- 
eral predicted proteins shared high sequence identity with pre- 
dicted proteins in two published S. epidermidis phage genomes, 
even when no significant similarity could be detected at the DNA 
level. 

The E. faecalis, Propi. Carrol, and Pepto. Carrol genomes con- 
tain very few mobile elements. E. faecalis has two plasmids, the 
other two species have no plasmids, and all have very few genes 
encoding transposases. S. aureus has at least two plasmids and 
several transposase genes, whereas the 5. epidermidis strains have 
three to seven plasmids each and up to 15 transposase genes. 

S. epidermidis 

The disappearance of P. acnes and the late proliferation of Propi. 
Carrol, as well as changes in abundances of S. epidermidis, S. hominis, 
S. lugdunensis, and 5. aureus, illustrate the importance of species-level 
shifts during the colonization period studied here. However, finer- 
scale changes in community composition could also be docu- 
mented, illustrated best by consideration of the S. epidermidis strains. 

We identified at least three strains of S. epidermidis (Fig. 1, left). 
This finding relied heavily on the time series-based binning ap- 
proach, which allowed us to identify those regions that are com- 
mon to all strains (cluster 10) (Fig. 1, left) and those that are unique 
to the different strains (clusters 10.1, 10.3, and 10.4) (Fig. 1, left). 
Cluster 10.1 contains sequences from at least two strains (pre- 
dominantly strain 1) and a few phages. All strains have both 
unique genomic regions and plasmids (Fig. 1, right). The genomes 
of the two dominant strains, strain 1 and strain 3, were reas- 
sembled based on samples in which they were abundant (see 
Supplemental Material). The genomes were almost identical 
throughout —90% of their lengths, with only SNPs distinguishing 
them. 

We evaluated evidence that might suggest that the opposing 
abundance patterns of the two dominant strains may be related to 
the phage infecting each of them (Fig. 3). Based on abundance 
patterns, we conclude that each of the three strains was infected by 
at least one phage, and that each phage infects a single host (see 
below). Strain 3, whose proliferation after day 16 is accompanied 
by the decline of strain 1, maintains a relatively high abundance 
after that day, whereas its infecting phage 46 declines. This sug- 
gests that the strain was able to rapidly develop phage resistance, 
unlike strain 1 whose abundance remained similar to its infecting 
phage throughout the whole time period. Both strains lack a 
CRISPR system that was reported in other S. epidermidis strains 
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Figure 3. Abundance patterns of the Staphylococcus epidermidis strains 
(thick lines) and their infecting phage (thin dashed lines). All three phages 
exist primarily as prophages but also were identified as free-existing 
phages. Insertion positions are in regions that are shared by both abun- 
dant strains (and probably also by strain 4), but abundance patterns suggest 
that phage 1 3 and 14 infect strain 1, while phage 46 infects strain 3. 



(Marraffini and Sontheimer 2008); however, comparative analysis 
revealed that two genes encoding the abortive infection (Abi) 
bacteriophage system proteins were present within the strain 3 
genome but were missing from the strain 1 genome. The Abi 
bacteriophage system targets important steps in phage repro- 
duction such as replication, transcription, or reproduction and 
also leads to cell death (Labrie et al. 2010). These genes have only 
been reported in two other strains of S. epidermidis (VCU144 and 
SK135). 

The two dominant strains also differ in a number of genes 
encoding drug resistance proteins. Strain 1 contains a complete 
mec region consisting of the mecA gene (which encodes the PBP2a 
protein associated with methicillin resistance) and its two regu- 
lating genes mecRl and mecl (in this order), while strain 3 contains 
only mecA and part of mecRl but lacks mecl and the last part of 
mecRl including the PB site. Strains that carry the complete mec 
region may remain susceptible to methicillin when mecl fully re- 
presses the activation of mec A. The absence of mecl and the PB site 
of mecRl, on the other hand, results in constitutive expression of 
PBP2a leading to methicillin resistance (Petinaki et al. 2001). Strain 
1 also contains a gene encoding a multidrug antimicrobial extru- 
sion protein (MATE) that was lacking from strain 3, whereas the 
latter carries an aacA/aphD gentamicin and kanamycin resistance 
determinant (Ferretti et al. 1986) that is lacking from strain 1. Both 
strains, as well as the other staphylococci, carry several other drug 
resistance genes, which are, in some cases, also present in other 
genomes. The methicillin resistance gene femA (Hurlimann-Dalel 
et al. 1992) was found in all Staphylococcus genomes as well as the 
Propi. Carrol and E. faecalis genomes, but not the Peptoniphilus 
genome. 

Strain 1 in our data set carries a bap gene, but strain 3 does not. 
The Bap protein has been directly linked to biofilm formation in 
various staphylococci, including S. epidermidis (Tormo et al. 2005). 
This gene is also present in the genome of the biofilm-positive 
strain RP62A but is absent from the biofilm-negative strain ATCC 
12228. No toxin genes were found in either of the well-sampled S. 
epidermidis strains or in any other bacterium in the data set, except 
for S. aureus. 
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Our data provide evidence for lateral transfer of genes among 
closely related species, as several genes in S. epidermidis show un- 
expected high sequence similarity to genes of other genomes. For 
example, a complete arginine deiminase operon (one of two such 
operons in strain 1) shares 100% identity over its entire length 
with two published S. aureus genomes. In addition, this strain 
contains genes sometimes found on pathogenicity islands of 
S. aureus and other Staphylococcus species (see Supplemental Ma- 
terial). Each S. epidermidis strain has a few strain-specific trans- 
porters, including transporters for sulfate, nickel, and copper 
(strain 1), iron (transferring receptors), proline/betaine (strain 3), 
and a few transporters with uncertain function. 

S. epidermidis phage 

We assembled complete genomes of three phages and one pro- 
phage that infect at least three of the S. epidermidis strains in the 
community. All are primarily found as prophage, but three (phages 
13, 14, and 46) are also found as free-existing phage (based on 
paired-end reads that connect both sides of the phage genomes). 
Integration sites for the phages are in regions that are common to 
both strains 1 and 3 (and possibly for the other strains as well). 
Based on matching time series abundance patterns for host and 
phage (and consistent with the integration site information), we 
infer that strain 1 is infected by phages 13 and 14, whereas strain 
3 is infected by phage 46, and prophage 16 is integrated into the 
genome of strain 4 (Fig. 3). The abundance of phage 46 decreased 
dramatically after day 17, potentially due to abortive infection 
bacteriophage resistance genes in strain 3 (see above). 

Phages 13 and 14 are closely related and share some sequence 
similarity to phage 46 (Supplemental Fig. S8). Phage 46 contains 
eight genes (all hypothetical) with no similarity to genes of any 
phage in any database. The genome of prophage 16 exhibited no 
similarity to any published genome throughout its entire length, 
except for a gene encoding a transposase. Interestingly, many of its 
predicted proteins showed high similarity, usually 90% identity or 
more, to published protein sequences. 

We compared the genomes of the three S. epidermidis phages 
and the prophage to the genomes of two previously published 
S. epidermidis phages, PH15 and CNPH82 (Supplemental Fig. S10; 
Daniel et al. 2007). The genomes of the three phages exhibit sim- 
ilarity in terms of overall length and gene synteny (see Supple- 
mental Material for a detailed discussion). Based on gene content, 
the three new genomes belong to the Siphoviridae family. Phage 
46 differs from the other phage in a number of ways. In addition to 
differences in gene content, predicted tape measure protein (TMP) 
lengths (Katsura 1987) indicate that the tail of phage 46 is longer 
than all others in the comparison set. 

A few short phage fragments are also clustered with the 
S. epidermidis strains, indicating that other low abundance phages 
were present in the data sets. All fragments exhibit some similarity 
to one of the sequenced phages. 

Comparative analysis of two Propionibacterium species 

We recovered essentially-complete and partial genomes of Propio- 
nibacterium species. Based on the partial genome, the rare P. acnes 
strain found only in the first two samples analyzed (Fig. 4, left) is 
most similar to P. acnes strain SK137 (Fig. 4, right). The dominant 
Propionibacterium species, Propi. Carrol, is related to both SKI 3 7 
and Propionibacterium humerusii, which was isolated from a human 
patient undergoing revision of a shoulder arthroplasty (Butler- Wu 
et al. 2011). 16S rRNA gene identity relative to both P. acnes SK137 



and P. humerusii P08 is 96.6%, (average aa identity for putative 
orthologous proteins 86.6%). The P. acnes SK137, P. humerusii, and 
Propi. Carrol genomes were aligned so that their gene content 
could be compared. The three genomes are essentially syntenous, 
so alignments highlight gene insertion/deletion events and assist 
in distinguishing orthologs from nonorthologous (nonsyntenous) 
genes that share sufficient sequence similarity to be brought into 
the analysis. 

In all comparisons, many regulatory and transport genes are 
genotype-specific. In the case of transport genes, only Propi. Carrol 
has two ammonium transport genes and an adjacent gene for 
a nitrogen regulatory protein. This species is also the only one with 
transport and other genes required for arsenic resistance. In con- 
trast, the P. acnes SK137 is enriched in genes coding for transporters 
for organic compounds such as glucitol/sorbitol (A,B,C compo- 
nents and glucitol operon activator protein) and cellobiose. Given 
this, it is interesting that SK137 encodes a larger number of gly- 
cosyl hydrolases (GH) than was found in the other two propioni- 
bacteria used in the comparison (families 1, 2, 3, 18, 20, 25, 31, 35, 
85). Further, the representation of GH in the three genomes is 
markedly different. SKI 3 7 encodes a larger number of glycosyl 
hydrolases than was found in the other two propionibacteria and is 
enriched relative to Propi. Carrol in genes for compounds such as 
chitobiose, chitodextrins, and oligosaccharides (see Supplemental 
Materials). 

Propi .Carrol and P. acnes SKI 3 7 differ in their sialic acid-re- 
lated genes. P. acnes SK137 carries genes for sialic acid utilization, 
including a gene for a sialic acid transporter, three sialidases 
(neuraminidases), and for interconversion of N-acyl-D-glucosamine 
6-phosphate and N-acyl-D-mannosamine 6-phosphate. The sialidase 
genes (EC:3.2.1.18) are absent from Propi. Carrol. Sialidase was also 
reported in other strains of P. acnes (Bruggemann et al. 2004) and 
is used for the degradation of host tissues (Nakatsuji et al. 2008). 
Propi. Carrol, on the other hand, has genes for sialic acid bio- 
synthesis. Interestingly, cell surface sialic acid capsule production 
appears to be relatively rare in Gram-positive bacteria (Severi et al. 
2007). The ability to produce this nine-carbon sugar acid may be 
important, as it is the predominant and terminal acid in complex 
glycans on mucins and glycoproteins associated with vertebrate 
cell membranes (where it may function in, for example, in- 
tercellular adhesion and cell signaling). It also may be produced in 
the gut in the course of inflammation (Severi et al. 2007). 

Propi. Carrol is apparently unable to utilize myo-inositol. 
Inositol plays a role in eukaryotic cell messaging and can be de- 
rived from nutritional sources. In fact, breast milk (especially co- 
lostrum) is rich in inositol (Pereira et al. 1990). Infant formula 
contains less inositol, and solutions for intravenous nutrition 
generally lack it. Inositol diet supplementation for premature in- 
fants receiving parenteral nutrition has shown benefits for treat- 
ment of infants with respiratory difficulties (Hallman et al. 1992). 
It is intriguing that the dominant Propionibacterium species in the 
gut community studied here is apparently unable to utilize inosi- 
tol. It is also apparently unable to use carnitine, a compound re- 
quired for fatty acid utilization in eukaryotes. Carnitine is also 
a source of osmoprotectant (it dehydrates to form crotonobetaine), 
and can be utilized by microorganisms as a source of C and N and 
serve as an external electron acceptor in anaerobic respiration 
(Walt and Kahn 2002). Thus, comparative analyses suggest that 
utilization of carnitine by propionibacteria might only be impor- 
tant early in the gut colonization of the infant. Also intriguing is 
the lack of genes for molybdopterin biosynthesis and anaerobic 
dimethyl sulfoxide reductase and nitrate reductase complexes in 
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Figure 4. Abundance patterns of Propionibacterium-re\ated genomes and their alignment against P. acnes SKI 37. (Left) Abundance of Propi. Carrol (red) 
and P. acnes (blue) as well as one Propionibacterium phage (green). The phage could not be linked to any of the species, either through integration site or 
through its abundance pattern. (Right) Alignment of both Propi. Carrol and P. acnes species to the genome of the skin P. acnes SKI 37. 



the genome of Propi. Carrol. The absence of these genes may in- 
dicate a reduced capacity for anaerobic respiration in the Propi. 
Carrol genotype and is consistent with an overall transition in the 
community toward fermentation-based metabolism. 

Interestingly the Propi. Carrol genotype has genes to synthesize 
nicotinamide adenine dinucleotide (NAD), in contrast to SK137. 
Unlike SKI 3 7, it also has genes involved in glycerophospholipid 
metabolism, amino acid biosynthesis, terpenoid biosynthesis, 
maltose sugar transformation, Von Willebrand factor production 
(possibly involved in attachment), and diacylglycerol kinase syn- 
thesis that may convert diacylglycerol (DAG) to phosphatidic 
acid (PA), a compound possibly used for (lipid-based) signaling. 

Notably, Propi. Carrol contains a CRISPR/Cas locus involved 
in phage resistance that is distinct from that in the P. humerusii 
genome and is present in some P. acnes strains but not in SKI 3 7. No 
spacer (crRNA) encoded by this locus matches genomically sam- 
pled Propionibacterium phage; thus, the locus probably does not 
confer resistance to this phage. 

We also conducted comparative genomic analysis of Pepto. 
Carrol and the previously sequenced Peptoniphilus harei ACS- 146- 
V-Sch2b. Details are provided in the Supplemental Material. 

Discussion 

We documented species and strain abundance variations, as well as 
associated genomic characteristics, during the early stages of mi- 
crobial colonization of the human gut. At the strain level, we 
detected at least three strains of S. epidermidis, two of which were 
sufficiently abundant for their genomes to be completely assem- 
bled. It has been previously suggested that multiple strains provide 
the necessary genetic versatility that allows a species to withstand 
selective pressures such as phage blooms (Ley et al. 2006). The data 
presented in this paper, showing dramatic changes in S. epidermidis 
strain abundance, correlated with changes in hostiphage abun- 
dance, may indicate that strain variation is, indeed, important for 
species stability as phage infectivity changes. Notably, the shift in 
the hostiphage ratio for strain 3 from —1:1 to 10:1 around day 
17 may suggest a change in either host resistance or phage in- 
fectivity, possibly associated with activation of the abortive in- 
fection bacteriophage system. The difference in genes encoding 
drug resistance, transporters, and regulatory proteins in strains 1 
and 3 could also impact their relative abundances. Overall, these 
strains differ in —10% of their genes, some of which have similar 
homologs in other staphylococci, in particular S. aureus, but not in 



any published S. epidermidis genomes. This finding is consistent 
with previous reports (Chan et al. 2011), and suggests that other 
species of Staphylococcus can contribute to the gene pool that re- 
sults in S. epidermidis strain variation. 

The microbial community in our data set is mostly dominated 
by E. faecalis as well as skin-associated bacterial genera such as 
Propionibacterium (two species), Staphylococcus (four species and 
multiple strains), and Peptoniphilus. The early detection of a P. acnes 
strain is interesting and may suggest the derivation of this early 
colonist from the skin of a parent or caregiver. This is in line with 
previously observed microbial community structure in infants 
born by C-section (Dominguez-Bello et al. 2010). The proliferation 
of the novel Propionibacterium and Peptoniphilus species in later 
stages suggests that they are better adapted to the gut environment 
than the other early colonizers that did not proliferate. Compar- 
ative analysis reveals numerous differences between the Propi. 
Carrol and SKI 3 7 genomes, and we infer that these genotypic 
differences contribute to the apparently increased fitness of Propi. 
Carrol within the gut environment (Fig. 5). These differences in- 
clude genes encoding different transporters, glycosyl hydrolases, 
and molecule utilization and resistance to metals and arsenic. One 
notable example of a strategy that differentiates Propionibacterium 
types involves genes related to sialic acid: P. acnes SKI 3 7 carries 
genes for sialic acid utilization, while Propi. Carrol carries genes for 
sialic acid biosynthesis. Production of cell surface sialic acid is 
known in pathogenic bacteria to allow the cells to evade the host's 
immune response (Severi et al. 2007). Thus, the transition from 
P. acnes to Propi. Carrol genotype in the infant studied here may be 
associated with a switch in the role of propionibacteria to pro- 
duction of sialic acid rather than consumption. This change could 
potentially alter the pathogenic significance of these organisms. 

Until recently, sequencing capacities were only sufficient to 
enable the recovery of near-complete genomes from systems in 
which the dominant organisms were highly abundant (Tyson et al. 
2004). Genome recovery has been accomplished in the gut envi- 
ronment but only for the most dominant genomes in low com- 
plexity communities (Morowitz et al. 2011). The introduction of 
high-throughput sequencing technologies provided the oppor- 
tunity to scale these methods for analysis of more complex sys- 
tems. Recently, the possibility of extracting one complete or near- 
complete genome from a high-throughput metagenomic data set 
was demonstrated (Iverson et al. 2012). The current study differs 
from these previous studies, first, in that eight essentially com- 
plete and near-complete microbial genomes were recovered and, 
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with broad spectrum antibiotics (ampi- 
cillin/gentamicin) during the first 2 d of 
life but did not receive antibiotics during 
the remainder of the study period. She 
received fortified maternal breast milk as 
well as supplemental parenteral nutrition 
until caloric intake from enteral nutrition 
was adequate (post-natal day 20). The 
infant required mechanical ventilation 
during the first 6 wk of life due to respi- 
ratory distress syndrome and broncho- 
pulmonary dysplasia but ultimately was 
weaned from mechanical ventilation and 
supplemental oxygen therapy. She was 
discharged to home at 3 mo of age. 



Arsenite 



Figure 5. Comparison of selected functionalities that differentiate Propi. Carrol and P. acnes SKI 37. 



second, in the comprehensiveness of the analysis: 96% of the 
reads were assembled into >500-bp scaffolds and binned. In fact, 
the vast majority of scaffolds were from genomes whose overall 
relative abundance was >0.05% in the data set, and these were 
accurately binned to the organism of origin. Time series-based 
binning proved to be both accurate and sensitive because sample 
compositions varied, and genomes had unique abundance pat- 
terns. This approach should be generally applicable for the bin- 
ning of metagenomic data collected over a time course, even for 
more complex environments. Potential applications of this method 
may include samples collected along geochemical or other gradi- 
ents, from different environments, or over longer periods of time, 
so long as the samples share the same genomes. However, the 
application will be limited if samples contain different strains of 
the same species. 

Significant effort was invested in assuring the accuracy of the 
assembled genomes. This was achieved using new bioinformatic 
tools and manual curation guided by new programs. The resulting 
genomes are of very high quality, superior to draft genomes gen- 
erated in most isolate genome sequencing projects. Overall, this 
study has demonstrated the potential for community genomic 
analyses to uncover, at high resolution, the patterns of organism 
abundance, phage abundance, and organism physiology that are 
important during establishment of the human microbiome. 

Methods 

Data analysis process overview 

Figure 6 outlines the data analysis pipeline that was developed for 
this study. We provide here a concise description of the different 
steps. For a complete description, refer to the supporting online 
material. 

Data acquisition 

Medical background of human subject 

The subject of this study is a female infant who was delivered by 
Caesarean section at 26 wk of gestation at the Comer Children's 
Hospital at the University of Chicago. She was treated empirically 



Sample collection 

Eleven fecal samples were collected be- 
tween days 15 to 24 and sequenced on 
three lanes of an Illumina HiSeq2000 se- 
quencer for 101 cycles from each end 
using TruSeq SBS sequencing kits version 
2. Data were analyzed with pipeline 1.7 
according to the manufacturer's instruc- 
tions (Illumina). Overall, 254 million high- 
quality reads (24.5 Gbp) remained after trimming and removing 
human reads (refer to Supplemental Table S2 and Methods in the 
Supplemental Material). 

Data preparation 

Data were trimmed using an in-house script that removes all 
consecutive bases starting from the 3 ' end of each read with quality 
values less than 7. Reads with <60 bp remaining after trimming 
were discarded. Average insert size and insert size standard de- 
viation for each sample, which are required for the assembly, were 
determined based on mapping of reads from each sample to 
a preliminary, nonoptimized assembly. 

Assembly 

Data from all 1 1 samples were assembled together. We used a cov- 
erage-directed iterative approach that is based on separate assem- 
bly of different coverage bins, i.e., coverage ranges that include one 
or more genomes. This approach makes it possible to choose the 
optimal set of parameters (coverage-related parameters and k-mer 
size) for each coverage bin. For each iteration, assembly parameters 
that fit the coverage bin with the highest coverage remaining are 
used, and the data assembled using Velvet (Zerbino and Birney 
2008) (ABySS [Simpson et al. 2009] was used for the last iteration). 
Scaffolds that fit only the coverage range that was defined for the 
current iteration are added to the overall resulting assembly, and 
the reads that map to these scaffolds are identified (using Bowtie 
[Langmead et al. 2009]) and removed from further analysis. 

Assembly quality control 

Misassemblies were detected based on the identification of posi- 
tions that were not covered by any insert whose zero insert cov- 
erage was statistically significant. The significance of such posi- 
tions was determined using a simple statistical framework that is 
based on Lander and Waterman (1988). Scaffolds containing such 
positions were split. Assembly gaps, represented by stretches of N ; s 
that were added as part of the scaffolding process in the assembly 
stage, were filled whenever possible based on the assembly of reads 
that were mapped to the edges of the gaps and their paired-end 
mates. Separate such assemblies were generated for both edges of 
each N-segment, and their consistency was confirmed. 
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Figure 6. Analysis pipeline for metagenomic data employed in this study. Refer to the Methods section and the supporting online material for more 
details. 



Binning 

We used ESOM with time series abundance profiles as follows. All 
scaffolds longer than 3 Kbp were broken into 3-Kbp (nonterminal) 
segments. Any remaining sequence was added to the adjacent 
3-Kbp segment, forming sequences of up to 6 Kbp (terminal). 
These, together with the rest of the scaffolds that are longer than 
500 bp, correspond to data points in the ESOM. Coverage was 
computed for each segment in each sample based on the number 
of reads from the sample that were mapped to it. These coverages 
were normalized twice: first, by the sum of reads for that sample 
and then by the sum of the segment's coverages across the 11 
samples. The second normalization is required by ESOM in order to 
eliminate large differences in coverage that may skew the output. 
Parameters for ESOM were chosen based on Dick et al. (2009), 
clusters were extracted manually from the map shown in Figure 1, 
left. Data points were labeled and colored based on best blast hit of 
each data point's scaffold from the NCBI nucleotide database. 

Assembly optimization 
Manual curation 

All complete and essentially complete genomes (Table 1) were 
manually curated in order to extend scaffolds, detect misassemblies, 
and verify assembly completeness. Examination of scaffold edges 
or suspect regions within the scaffolds was done based on local as- 
semblies of reads that mapped to the region of interest and their 
paired-end mates. 

Reassembly 

For the genomes of S. epidermidis strains 1 and 3, we identified 
samples in which each of the strains was most abundant. These 
genomes were reassembled separately based on reads (and their 



mates) from these samples that were mapped to S. epidermidis 
scaffolds generated in the initial assembly. 

Genome completeness 

Completeness of all genomes that were not assembled into one 
piece was evaluated based on the following three indicators: (1) 
Presence of a set of 60 single-copy genes (Supplemental Fig. S9); (2) 
scaffold connectivity — genomes containing at least 95% of the 
single-copy genes went through connectivity inspection, in which 
scaffold ends were verified to be connected to other scaffolds in the 
same genome; and (3) a rough estimation for the completeness for 
the rest of the genomes was estimated based on size comparison to 
closely related genomes. 

Gene prediction 

Prodigal (Hyatt et al. 2010) was used for gene prediction, and an 
in-house pipeline was used for functional annotation. The pipe- 
line includes BLAST-based similarity searches (against NR, KEGG, 
UniRef 90, COG) and HMM-based functional domain recognition 
searches (Quevillon et al. 2005). 

Downstream analyses 

ggKBase (see below) was used for metabolic analysis of the different 
genomes. 

Data access 

The entire data set is publicly available via an open knowledgebase 
(ggKBase: http://ggkbase.berkeley.edu/carrol/). Assemblies, genes, 
and predicted proteins are available at http://ggkbase.berkeley. 
edu/carrol/download/index.html. Read data sets were deposited in 
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Table 1. Information for assembled genomes 
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Staphylococcus epidermidis strain 4 b 
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Staphylococcus aureus strain 2 C 
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Candida albicans 
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a Genome size is shorter than most strains. 

information represents regions unique to 5. epidermidis strain 4. 

information represents regions unique to 5. aureus strain 2. 

d 1, 348,806 non-N bp. 

e 207,961 non-N bp. 

f 1 72,692 non-N bp. 

Estimated based on total number of reads that were mapped to 5. epidermidis strains 1 and 3 and the ratio between the number of reads that were 
mapped to unique regions in the two genomes. 

Estimated based on total number of reads that were mapped to phages 1 3 and 1 4 and the ratio between reads that were mapped to unique regions in the 
two genomes. 



the NCBI Sequence Read Archive (SRA) (http://www.ncbi.nlm.nih. 
gov/sra) under accession number SRA052203. 
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